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Abstract 

We consider the framework of Independent Component Analysis (ICA) for the case where the 
f^ ' independent sources and their linear mixtures all reside in a Galois field of prime order P. Similarities 

^^ I and differences from the classical ICA framework (over the Real field) are explored. We show that 

^ , a necessary and sufficient identifiability condition is that none of the sources should have a Uniform 

^.f-N i distribution. We also show that pairwise independence of the mixtures implies their full mutual 

independence (namely a non-mixing condition) in the binary (P = 2) and ternary (P = 3) cases, 
but not necessarily in higher order {P > 3) cases. We propose two different iterative separation (or 
identification) algorithms: One is based on sequential identification of the smallest-entropy Unear 
O . combinations of the mixtures, and is shown to be equivariant with respect to the mixing matrix; The 

other is based on sequential minimization of the pairwise mutual information measures. We provide 
some basic performance analysis for the binary {P — 2) case, supplemented by simulation results for 
f^~. I higher orders, demonstrating advantages and disadvantages of the proposed separation approaches. 

o 



I. Introduction 

Independent Component Analysis (ICA, see, e.g., 121, ||3l, lH for some of the fundamental princi- 
ples) addresses the recovery of unobserved, statistically independent source signals from their observed 
linear (and invertible) mixtures, without prior knowledge of the mixing matrix or of the sources' 
statistics. Classically, the ICA framework assumes that the sources and the mixing (hence, also the 
observations) are defined over the field of real-valued numbers M, with some exceptions (e.g., iQ) 
that assume the field of complex-valued numbers C It might be interesting, though, at least from a 
theoretical point of view, to explore the applicability of ICA principles in other algebraic fields. 

In this work we consider ICA over Galois Fields of prime order P, denoted GF(P), such that 
the sources and the mixing-matrix' elements can all take only a finite number of values, defined 

The author is with the Department of Electrical Engineering - Systems, Tel-Aviv University, Tel-Aviv, Israel. Some part 
of this work was presented at ICA' 07 |T]. 



by the set {0, 1, ..., P — 1} (or by some offset, isomorphic version thereof), and where addition and 
multiplication are applied modulu P, thereby returning values in the same set. 

For example, in the field GF(2) of binary numbers {0, 1}, addition is obviously equivalent to the 
"Exclusive Or" (XOR) operation, denoted z = x (B y (where z equals 1 ii x ^ y and equals 
otherwise). Multiplication (either by or by 1) is defined in the "usual" way in this case. 

In the field GF(3) of ternary numbers {0,1,2}, where addition and multiplication are defined 
modulu 3 (similarly denoted z = x (B y), it is sometimes more convenient to consider the offset 
group {0, 1, —1}. In this group, multiplication can still be defined in the "usual" way, since ordinary 
multiplication of any two numbers in this group returns a number in the group. Obviously, the two 
sets {0, 1, 2} and {0, 1, —1} are isomorphic in GF(3), and will be used interchangeably in the sequel. 

A fundamental difference, at least in the context of ICA, between random variables over M and 
over GF(P) is the following: Let u and v be two statistically independent, non-degenerate (namely, 
non-deterministic) random variables, and consider the random variable w, given by any non-trivial 
linear combination of u and v. In M, v and w cannot be statistically independent (they are obviously 
correlated), no matter how u and v are distributed. However, as we shall show in Section|lIIl in GF(P) 
V and w may indeed be statistically independent, and this happens if and only if the distribution of 
u is uniform (taking each of the P values with equal probabilities). 

In a sense, this property tags the uniform distribution as the "problematic" distribution in ICA over 
GF(P), reminiscent of the role taken by the Gaussian distribution in ICA over M. Note that these 
two distributions share additional related properties in their respective fields: They are both (under 
mild regularity conditions) limit-distributions of an infinite sum of independent random variables; 
and they are both "maximum entropy" distributions (subject to a variance constraint for the Gaussian 
distribution in M). So, loosely stated, in the same way that a linear combination of independent 
random variables over M tends to be "more Gaussian", a linear combination of independent random 
variables over GF(P) tends to be "more uniform". 

Nevertheless, there still remain some essential differences between the roles of these distributions 
in the respective contexts. For example, in GF(P), if (at least) one of random variables in the linear 
combination of independent variables is uniform, the resulting distribution would be exactly uniform 
as well, no matter how the other random variables are distributed. Evidently, this property does not 
hold for Gaussian distributions over M. 

Therefore, as we shall show, these properties lead to an identifiability condition for ICA over 
GF(P), which is reminiscent of, but certainly not equivalent to, a well-known identifiability condition 
over M. More specifically, the identifiability condition for ICA over M requires that not more than 
one of the sources be Gaussian. Our identifiability condition for ICA over GF(P) requires that none 
of the sources be uniform. The key to this identifiability condition is the property that the entropy 



of any linear combination of statistically independent random variables over GF(P) is larger than 
the entropy of the largest-entropy component, as long as this component is not uniform. Therefore, 
if none of the sources is uniform, then, at least conceptually, a possible separation approach is to 
look for the (inverse) linear transformation, which minimizes the empirical marginal entropies of 
the resulting linear combinations. However, since an exhaustive search for this transformation would 
often be prohibitively computationally expensive, we shall propose an alternative, computationally 
cheaper method for entropy-based identification. 

Another possible, somewhat different separation approach is the following. One of the key obser- 
vations in ICA over R is that, under the identifiability condition and due to the Darmois-Skitovitch 
theorem (e.g., [6l, p.218), pairwise-independence of the mixtures implies their full mutual indepen- 
dence, which in turn implies a non-mixing condition (namely, separation). Interestingly, we shall show 
that our general identifiability condition is necessary and sufficient to guarantee a similar property 
for ICA over GF(2) and GF(3), but is generally insufficient for this property to hold in GF(P) for 
P > 3. Thus, another possible identification approach (in GF(2) and in GF(3) only) is to look for an 
invertible linear transformation of the observations, which makes the resulting signals "as empirically 
pairwise-independent as possible" - a property which is easier to quantify and measure than full 
independence (being quadratic, rather than exponential, in the number of sources K). Again - since 
an exhaustive search is often not feasible, we shall propose a different, sequential method for this 
approach. 

A common assumption in the design and analysis of classical ICA methods over M, is that each of 
the sources has an independent, identically distributed (iid) time-structure. Our discussion in this paper 
would be similarly restricted along the same line. We note, however, that in equivalence to methods 
which exploit possibly different temporal structures (e.g., spectral diversity Q, non-stationarity IH, 
etc.) over R, similar extensions of our results would be possible in similar cases over GF(P). However, 
we defer the exploration of such cases to future work. 

The paper is structured as follows. In the next section we review some fundamental properties of 
random variables and random vectors in GF(P), which wiU be useful in subsequent derivations. In 
Section JII] we outline the problem formulation and present our general identifiability condition. In 
Section JV] we explore the relation between pairwise independence and full independence, showing 
that in an invertible linear mixture, the former implies the latter in GF(2) and in GF(3), but not 
necessarily in Galois fields of higher orders. We then proceed to propose two different separation 
algorithm in Section |Vl A rudimentary performance analysis for the simple binary case {P = 2) is 
provided in Section |Vll supplemented with supporting simulation results which extend to larger-scale 
scenarios. Our work is summarized with concluding remarks in Section IVIII 

We shall denote addition, subtraction and multiplication over GF(P) (namely, modulu P) by ©, G 



and (g), respectively, with multiplication preceding addition and subtraction in the order of operations. 
Vector multiplication will be denoted by o, such that if a = [ai • • • axY ^^^ x = [xi ■ ■ ■ x;^]^, 
then 

aF o X = ai ® xi ® a2 <^ X2 ® ■ ■ ■ ® aK ® xk- (1) 

Similarly, if A is an L x ET matrix in G¥{P), its product with x is denoted Ao x, an L xl vector 
whose elements are the products of the respective rows of A with x. 

II. Characterization of random variables and random vectors in GF(P) 

We begin by briefly outlining some of the basic essential properties and definitions of our notations 
for random variables and random vectors in GF(P), which we shall use in the sequel. 

A random variable u in GF(P) is characterized by a discrete probability distribution, fully described 

by a vector p„ = [pu(0) Pu(l) • • • Pu{P — 1)]"^ S IR > whose elements Pu{m) are Pr{n = m}, 

the probabilities of u taking the values m € {0, . . . , P — 1}. Evidently, all the elements of p„ are 

non-negative and their sum equals 1. We shall refer to p„ as the probability vector of u. The entropy 

of u is given by 

p-i 

-^('") = ~ X] P"("^) ^ogpu{m). (2) 

By maximizing with respect to p„, it is easy to show that among all random variables in GF(i-*), 
the uniform random variable (taking all values in GF(i-') with equal probability -p) has the largest 
entropy, given by log P. Note that it is convenient to use a base-P logarithm logp (rather than the 
more commonly-used log2) in this context, such that the entropies of all (scalar) random variables in 
GF(P) are confined to [0, 1]. Note, in addition, that since multiplication by a constant over GF(P) 
is bijective, the entropy of a random variable in GF(P) is invariant under such multiplication (which 
merely re-arranges the terms in the sum Q). 

The characteristic vector of u is denoted p„ = [pu{^) Pu{^) ••• Pu{P — 1)]^ £ C , and its 
elements are given by the discrete Fourier transform (DFT) of the elements of p: 

p~i 
p„(n) = i?[Tyr] = 5]p«MWp" n = 0,...,P-l, (3) 

m=0 

where the "twiddle factor" Wp is defined as Wp = e~^'^'^^^ (note that the modulu-P operation is 
inherently present in the exponential part, so Wp"^ is equivalent to W^'^"). Like the probability 
vector p„, the characteristic vector p^ provides full statistical characterization of the random variable 
u, since p„ can be directly obtained from p^ using the inverse DFT. 
The following basic properties of p„ can be easily obtained: 
PI) p„(0) = 1; 



P2) Since p„ is real- valued, Pu{n) = p*^{P — n) (where the superscript * denotes the complex- 
conjugate); 
P3) u is uniform (namely, Pu{m) = -p Vm) <^ Pu{n) = Vn 7^ 0; 
P4) u is degenerate (namely, pu{M) = 1 for some M) <^ iJM("') = Wp^^ V?i; 
P5) |pu(n)| < 1 V?i, where for n ^ equality holds if and only if (iff) u is degenerate. 
Note that in the particular cases of GF(2) and GF(3) we have the following simplifications: 

• In GF(2), the only free parameter in p„ € M? is Pu(l), to which we shall refer as 

0u = Pn(l) = Pn(0) - Pn(l) = 1 " 2p„(l). (4) 

Thus p„ = [1 e^f; 

• In GF(3), there is also a single (yet complex- valued) free parameter in p„ G C^, to which we 
shall refer as 

^u=Pu{l)=Pu{0)+Pu{l)W^^+Pu{2)W.^^ = l-|(p„(l)+p42))+i^(p„(2)-p„(l)). (5) 

Thus p„ = [1 Cn Cf. 
Note also that Ou = E[W^] = E[{-1)''] and Cu = E[W^]. 

For two random variables u and v in GF(P), the joint statistics are completely described by the 
joint probabilities matrix Pu,v € M ^ , whose elements are Pu^y{m,n) = Pr{n = m,v = n}, 

m, n € {0, . . . , P — 1}. The joint entropy of u and v is given by 

p-i 
H{u,v) = - ^ Pu,v{m, n) log Pu,vim,n). (6) 

m,n=0 

The random variables u and v are said to be statistically independent iff Pu,d = PuPv- ^y Jensen's 
inequality, H{u, v) satisfies H{u, v) < H{u) + H{v), with equality iff u and v are statistically inde- 
pendent. The mutual information between u and v is the difference /(n, v) = H{u) + H{v) — H{u, v), 
which is also the (non-negative) KuUback-Leibler divergence between their joint distribution and the 
product of their marginal distributions. The smaller their mutual information, the "more statistically 
independent" u and v are; I{u, v) vanishes if and only if u and v are statistically independent. 
The conditional distribution of u given v is given by P„u € M^^^ with elements P„L,(?7i,n) = 

Pu,v{rn, n)/pi,{n) = Ft{u = m\v = n}, m,n = 0, . . . ,P — 1. The conditional entropy is defined as 

p-i p-i 

H{u\v) = - ^p^,(n) ^ P„|„(m,n)logP„|„(m,n), (7) 

n=0 m=l 

which can be easily shown to satisfy H{u\v) = H{u,v) — H{y). 

The joint characteristic matrix of u and v, denoted Pu,^ € C^^^, is given by the two-dimensional 

DFT (2DFT) of P„,„, 

p-i 

PuAm,n) = E[W^^+n = E ^n,.(fc,^)H^p™'+"', (8) 

k,e=o 



and provides an alternative full statistical characterization of u and v. In particular, it is straightforward 
to show that Pu,v satisfies -P„,u = PuPi< iff ^ ^'^'^ ^ ^^ statistically independent. 

For a i^ X 1 random vector u whose elements ui, . . . , uk are random variables in GF(P), the joint 
statistics are fully characterized by the ET-way probabilities tensor Vu G K^ ^ , whose elements are 
the probabilities Vu{mi, . . . , mx) = Prjui = mi, . . . , uk = ^it-k}, i^i, ■ ■ ■ , f^K £ {0, . . . , -P — 1}. 
Using vector-index notations, where m = [mi, • • • ^mxY , we may also express this relation more 
compactly as Vuifn) = Pr{it = m}. The characteristic tensor "P^ G C^ ^ is given by the 
iC-dimesional DFT of T'u, which, using a similar index-vector notation, is given by 

V^{n) = E[W^^''\ = Y,V^{m)Wj'^'^. (9) 

m 

where the summation extends over all possible P^^ indices combinations in m. 

III. Problem Formulation and Indentifiability 

We are now ready to formulate the mixture model over GF(P). Assume that there are K statistically 
independent random source signals denoted s[t\ = [si[t\ S2[t\ ■ ■ ■ SK[t]]'^, each with an iid time- 
structure, such that at each time-instant t, Sk [t] is an independent realizations of a random variable 
in GF(P), characterized by the (unknown) distribution vector p^. 

Let these sources be mixed (over GF(P)) by an unknown, square (K x K) mixing matrix A (with 
elements in GF(P)), 

x[t] = Aos[t]. (10) 

We further assume that A is invertible over the field, namely that it has a unique inverse over 
GF(P), denoted B = A^^, satisfying BoA = AoB = I, where I denotes the K x K identit 






matrix. Like in "classical" linear algebra (over M), A is non-singular (invertible) iff its determinan 
is non-zero. Equivalently, A is singular iff there exists (in GF(P)) a nonzero vector u, such that 
Ao u = (an all-zeros vector). 

We are interested in the identifiability, possibly up to some tolerable ambiguities, of A (or, 
equivalently, of its inverse B) from the set of observations x[t], t = 1,2,...T under asymptotic 
conditions, namely as T — )• oo. Due to the assumption of iid samples for each source (implying 
ergodicity), the joint statistics of the observations can be fully and consistently estimated from the 
available data. Therefore, the assumption of asymptotic conditions implies full and exact knowledge 
of the joint probability distribution tensor T'x of the observation vector x (we dropped the time-index 

'The determinant over GF(P) can be calculated in a similar way to calculating the determinant over R, using the field's 
addition/subtraction and multiplication operations. 



t here, due to the stationarity). The remaining question is, therefore - whether, and if so, under what 
conditions, A can be identified (up to tolerable ambiguities) from exact, full knowledge of T-'x- 

To answer this question, we first explore some basic statistical properties of linear combinations of 
random variables over G¥{P). The characteristic vectors are particularly useful for this analysis. Let 
u and V denote two statistically independent random variables in GF(P) with probability vectors p^ 
and p^ and characteristic vectors p^ and p^„ respectively. If w = u (B v, then the probability vector 
p^^ of w is given by the cyclic convolution between p^ and p^, and the characteristic vector p^, is 

therefore given by the element- wise product of p^ and p„: 

p-i p-i 

Pw{f^) = y^PT^{u = m,v = nQm} = y^ Pu{'m)Pv{nQm) <^ Pwin) = Pu{n)pv{n) Vn. (11) 

m=0 m=0 

Two intuitively appealing (nearly trivial) properties follow from this relation. First, combined with 
Property |^ (in Section ^, this relation implies that the sum (over GF(P)) of two independent 
random variables is a degenerate random variable iff both are degenerate. Likewise, combined with 
Property |P3l this relation implies that the sum is uniform if at least one of the variables is uniform. 
The converse, however, is perhaps somewhat less trivial, since it involves a distinction between GF(2) 
and GF(3) on one hand, and GF(P) with P > 3 on the other hand, as suggested by the following 
lemma: 

Lemma 1: Let u and v be two statistically independent random variables in GF(P), and let w = 
u®v. If both u and v are non-uniform, then: 

1) If P = 2 or P = 3, u) is also non-uniform; 

2) If P > 3, t(; may or may not be uniform. 

Proof: By Property |P3l w would be uniform iff for each n ^ 0, either Pu{n) = or Pv{n) = 
(or both). In GF(2) this can only happen if either 9^ = or 9y = (or both), which implies that at 
least one of the two variables is uniform. Likewise, in GF(3) this can only happen if either ^u = 
or ^^ = (or both), leading to a similar conclusion. 

However, for P > 3 there are sufficiently many degrees of freedom in the characteristic vectors of 
u and V to allow both non-zero and zero elements in both p„ and p„, as long as at each n ^ either 
one is zero. For example, consider P = 5 with p„ = [1 0.3 0.3 0]^ and p^ = [1 0.4 0.4]^. This 
corresponds to p„ ^ [0.32 0.10 0.24 0.24 0.10]^ and p^ ^ [0.36 0.25 0.07 0.07 0.25]^, which are 
clearly non-uniform. However, if these u and v are independent, their sum (over GF(5)) is a uniform 
random variable. ■ 

Note, in addition, that since multiplication by a constant in GF(P) is bijective, uniform or degener- 
ate random variables cannot become non-uniform or non-degenerate (nor vice-versa) by multiphcation 
with a constant. Consequently, the above conclusions and Lemma [T] hold not only for the sum of two 
random variables, but also for any linear combination (over GF(P)) thereof. 



We now add the following Lemma: 

Lemma 2: Let u and v be two statistically independent, non-degenerate random variables in GF(P), 
and let w = u®v. Then v and w are statistically independent iff u is uniform. 
Proof: The joint probability distribution of v and w is given by 

Pv,wi'm-, n) = Pr{v = m, w = n} = Pr{w = m,u = n Q m} = pi,{m)pu{n m). (12) 

Now, w and v are independent iff this probability equals Pv{m)Pw{'n) for all m, n, namely iff Pu{n Q 
m) = Pw{n) for all n and for all m with which Pvim) ^ 0. Since t; is non-degenerate, there are at 
least two such values of m. Denoting these values as mi and 1712, this condition translates into 

Puin Q mi) = puin Q 1112) = Pw{n) Vn. (13) 

We therefore also have Pu{n) = Pu{n(B mi Q 7712) Vn, which can be recursively generalized into 

Pu{n) = pu{n (S k (S> (mi e m2)) Vn, A;€GF(P). (14) 

Since P is prime, each element in GF(P) can be represented (given n, mi and 1712) as n(Bk'S){m,i Q 
7712) with some k, therefore this condition is satisfied iff p„(n) is constant, namely iff u is uniform. 

■ 

To establish our identifiability condition we need one additional lemma, which characterizes the 
entropy of a linear combination of random variables in GF(P). 

Lemma 3: Let u and v be two statistically independent, non-degenerate random variables in GF(P), 
and let w = u (B V. Then H{w) > H{u), where equality holds iff u is uniform. 

Proof: As already mentioned in Section|lIl H{w, v) < H{w)+H{v), with equality iff tt; and v are 
statistically independent. In addition, H{w\v) = H{w,v) — H{v). Therefore, H{w\v) < H{w), with 
equality iff w and v are statistically independent. Next, from (IT2l ) we have Py,|^(m, n) = pu{nQm), 

and therefore, as could be intuitively expected, 

p-i P-i P-i 

H{w\v) = ^ Pvim) ^ pu{n G m) logpu(n e m) = ^ p^{m)H{u) = H{u), (15) 

m=0 n=0 m=0 

and we therefore conclude that H{u) < H{w), with equality iff w and v are statistically independent. 
Now, according to Lemma |2l w and v are statistically independent iff u is uniform, which completes 
the proof. ■ 

Obviously, a similar result (namely H{w) > H{v)) can be obtained by switching roles between u 
and V in the proof. Note an essential difference from a similar result over M: In M the entropy (or 
differential entropy) of a sum of two independent, non-degenerate random variables is always strictly 
larger than their individual entropies, no matter how they are distributed. In GF(P), however, equality 
is attained if one of the variables is uniform. In fact, this equality is inevitable, simply because the 



entropy of any random variable in GF(P) is upper-bounded by the uniform variable's entropy (of 
logP). 

We are now ready to state our identifiability condition: 

Theorem 1: Let s be a K x 1 random vector whose elements are statistically-independent, non- 
degenerate random variables in GF(i-'). Let Ahe. & K yiK non-singular matrix in GF(P), and let the 
random vector x be defined as a; = Aos. Assume that the probability distribution of x is fully known 
(specified by the probabilities tensor Vx)- Then A can be identified, up to possible permutation and 
scaling of its columns, from T'x alone, iff none of the elements of s is a uniform random variable. 
Proof: The necessity of this condition is obvious by Lemma [2] Even in the simplest 2x2 case, 
if one of the sources, say s\, is uniform, then by Lemma |2] any linear combination of si with the 
other source S2 is still statistically independent of S2- Therefore, if the mixed signals are xi = si ©S2 
and X2 = S2, then xi and X2 are statistically independent - so this situation is indistinguishable from 
a non-mixing observation of two independent sources with the same marginal distributions as xi and 
X2 (which are also the marginal distributions of si and S2 (resp.) in this case). 

To observe the sufficiency of the condition, note first that since A is invertible over GF(P), 
any invertible linear mixture of the original sources s can be obtained by applying some invertible 
linear mixing to the observations x. Therefore, by applying all (finite number of) invertible linear 
transformations to x, one can implicitly obtain all the invertible linear transformations of s. Indeed, 
let B denote an arbitrary invertible matrix in GF(i-*), and denote 

ytBox = {BoA)os (16) 

Since both B and A are non-singular, so is S o A, which therefore: 

1) Has at least one non-zero element in each row; and 

2) Has at least one non-zero element in each column, which means that each element of s is a 
component of (namely, participates with nonzero weight in) at least one element of y. 

Now define the respective sums of (marginal) entropies, Hmar{y) = ^k=i-^iyk) ^^^ Hmar{s) = 
'Yl,k=i H{^k)- Consequently, by Lemma[3j Hmariv) cannot be made smaller than Hmari^)- Moreover, 
if none of the elements of s is uniform, then 

Hmariv) = Hmar{s) ^ B o A = U o A, (17) 

where 11 denotes a K x K permutation matrix and A denotes a K x K diagonal, nonsingular matrix 
in GF(P). Any other form oi B o A would imply that at least one of the elements of y is a linear 
combination of at least two elements of s, and as such has higher entropy than both, and since at 
least one of these two elements is also present in at least one other element of y, Hmar{y) must be 
larger than Hmar{s). 
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It is therefore possible, at least conceptually, to apply each K x K nonsingular matrix B in GF(P) 
to X, and select one of the minimizers of Hmar{y)- The inverse of this minimizer is guaranteed to 
be equivalent to A up to permutation and scaling, 

SoA = noA ^ B~^ = AoK^^ olf (18) 

(where all the inverses are obviously taken over GF(P)). ■ 

Note that in GF(2) the scaling ambiguity is meaningless, because the only possible scalar multi- 
plication is by 1, therefore only the permutation ambiguity remains. In GF(3) the possible scaling 
ambiguity entails multiplication by either 1 or 2, or, if the "offset group" {0, 1, —1} is used, this 
ambiguity merely translates into a sign-ambiguity. 

Although the number of K x K nonsingular matrices in GF(P) is finite, this number is of the 
order of P^^ \ which clearly becomes prohibitively large even with relatively small values of P 
and K. Therefore, our identifiability proof, which is based on an exhaustive search, can hardly be 
translated into a practical separation scheme. Nevertheless, in Section |V] below we shall propose 
and discuss two practical separation approaches, which require a significantly reduced computational 
effort. First, however, we need to address one more theoretical aspect of our model - which is: whether 
(and if so under what conditions) pairwise independence of Unear mixtures implies their full mutual 
independence. 

IV. Pairwise independence implying full independence 

One of the basic, key concepts in ICA over M is the Darmois-Skitovich Theorem (e.g., IS p.218), 
which is used, either explicitly or implicitly, in many ICA methods (121). This theorem states that if two 
linear combinations (over M) of statistically independent random variables are statistically independent, 
then all the random variables which participate (with non-zero coefficients) in both combinations must 
be Gaussian. Consequently (see, e.g., ||2l), under the classical identifiability condition (for ICA over 
M) of not more than one Gaussian source, pairwise statistical independence of linear mixtures of the 
sources always implies their fuU mutual statistical independence (namely, a non-mixing condition). 

As we shall show in this section, this property does not carry over to our GF(P) scenario by mere 
substitution of the Gaussian distribution with the uniform. As it turns out, under our identifiability 
condition (for ICA over GF(i-*)) of no uniform sources, pairwise independence implies full inde- 
pendence in GF(2) and in GF(3), but not in GF(P) with P > 3. The reason for this distinction 
is the distinction made in Lemma [T] above, regarding the possibility that a linear combination of 
non-uniform, independent random variables be uniform in GF(P) with P > 3 (but not in GF(2) or 
in GF(3)). 

Indeed, consider three independent random variables si, S2 and S3 in GF(5), with probability 
vectors p^ = P2 ^^id P3 (resp.) following the example given in the proof of Lemma [T] Namely, let 
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the respective characteristic vectors be given by pi = P2 = [I 0.3 0.3 0]^ and p^ = [1 0.4 0.4]^. 
This implies Pi = P2 ~ [0-32 0.10 0.24 0.24 0.10]"^ and pg ?» [0.36 0.25 0.07 0.07 0.25]^. Clearly, 
our identifiability condition is satisfied here, since none of these random variables is uniform. However, 
si © S3, as well as S2 © S3, are uniform. Thus, consider the mixing-matrix A = 010 which yields 

Xl = Sl 

X2 = S2 (19) 

X3 = Sl® 82® S3. 

Now, Xl and X2 are obviously statistically independent. Moreover, since S2 © S3 is uniform and 
independent of si, we deduce, by Lemma|2l that X3 and xi are also statistically independent. Similarly, 
by switching roles between si and S2, we further deduce that 3:3 and X2 are statistically independent 
as well. Therefore, xi, X2 and X3 are pair- wise independent, but are clearly not fully mutually 
independent. 

Obviously, such a counter-example cannot be constructed in GF(2) or in GF(3), since in these 
fields a linear combination of non-uniform, statistically independent random variables cannot be 
uniform. Furthermore, out following theorem asserts that, under our identifiability conditions, pairwise 
statistical independence of the mixtures indeed implies their full statistical independence in GF(2) 
and in GF(3). 

Theorem 2: Let s he a K x 1 random vector whose elements are statistically-independent, non- 
degenerate and non-uniform random variables in GF(2) or in GF(3). Let y = D o s denote a K x 1 
vector of non-trivial linear combinations of the elements of s over the field, prescribed by the elements 
of the K X K matrix D. 

If the elements of y are all pairwise statistically independent (namely, if y^ is statistically inde- 
pendent of yi for all k ^ £, k,i ^ {I, . . . K}), then D = Ilo A, where II is a K x K permutation 
matrix and A is a K x K non-singular diagonal matrix in the field. In other words, the elements of y 
are merely a permutation of the (possibly scaled) elements of s, and are therefore not only pairwise, 
but also fully statistically independent. 

Obviously, in GF(2) A must be I (no scaling ambiguity), and in GF(3) (assuming the group 
{0, 1, —1}), A has only itl-s along its diagonal (the scaling ambiguity is just a sign ambiguity). A 
proof for each of the two cases, GF(2) and GF(3), is provided in Appendix A. We now proceed to 
propose practical separation approaches. 

V. Practical Separation Approaches 

In this section we propose two possible practical separation approaches, based on the properties 
developed above. 
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Note that any approach which exploits the full statistical description of the joint probability distri- 
bution of X would require collection (estimation) and some manipulation of the probabilities tensor 
"Px, which is P^^ large, and, therefore, a computational load of at least 0{P^) seems inevitable. 
Still, this is significantly smaller (and often realistically far more affordable) than 0{K^P^^ ^) (as 
required by brute-force search for the unmixing matrix), even for relatively small values of P and 
K. 

Note further, that in order to obtain reasonable estimates of T'x in practice, the number of available 
observation vectors T has to be significantly larger than P^ (the size of Tx)- The estimation of Tx 
can be obtained by the following simple collection process: 

1) Initialize T'x as an all-zeros tensor; 

2) Fort = l,2,...,r, setVx{x[t]) ^ Vx{x[t]) + 1; 

j) oCt /-^x ^ y ' rx- 

Fortunately, however, a single collection of the observation's statistics for obtaining T^x is generally 
sufficient, since, in order to obtain the empirical statistical characterization T-'y of any linear trans- 
formation y = G o X of the observations (where G is an arbitrary L x K matrix with elements in 
GF(P)), it is not necessary to actually apply the transformation to the T available observation vectors 
and then recollect the probabilities. The same result can be obtained directly (without re-involving 
the observations), simply by applying a similar accumulation procedure to the A'-way tensor T'x in 
constructing the L-way tensor T^y-. 

1) Initialize Vy as an all-zeros tensor; 

2) Running over all P^ index-vectors i (from [0 • • • 0]-^ to [P — 1 • • • P — 1]-^), set 

Vy{Goi)^Vy{Goi)+V^{i). (20) 

Note that when G is a square invertible matrix, Vy is simply a permutation of T'x- 

A. Ascending Minimization of EntRopies for ICA (AMERICA) 

Our first approach is based on minimizing the individual entropies of the recovered sources. Concep- 
tually, such an approach can consist of going over all possible P^ — 1 nontrivial linear combinations 
of the observations, and computing their respective entropies. Then, given these entropies, we need 
to select the K linear combinations with the smallest entropies, such that their respective linear- 
combination coefficients vectors (rows of the implied unmixing matrix) are linearly independent (in 
GF(P)). 

Let us first consider the computation of the entropies of all possible (nontrivial) P^ — 1 linear 
combinations prescribed by the coefficients vectors i„ (for n = 1,...,P^ — 1). Each requires the 
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computation of the respective probabilities vector p of yn = in o x, by applying the above- 
mentioned tensor-accumulation procedure with G = i„ to the tensor Vx. Thus, the number of 
required multiplications is roughly 0{K ■ (P^)^) = 0{K ■ P"^^), which (for K > 2) is much smaller 
than 0{K^-P^^ ^) (the brute-force search cost), but may still be quite large. Fortunately, it is possible 
to compute the required probabilities vectors more conveniently, via the estimated characteristic tensor 
T'x, which can be obtained using a multidimensional Fast Fourier Transform (FFT). 

The proposed computation proceeds as follows. First, given the estimated probabilities tensor T'x, 
we obtain the estimated characteristic tensor T-'x using a iiT-dimensional FFT, by successively applying 
1-dimensional radix-P DFTs along each of the K dimensions. Thus, for each dimension we compute 
P^-i P-long DFTs, at the cost of 0{P^^^^ ■ {P log P)) = 0{P^ log P). The total cost for obtaining 
Vx is therefore 0{K ■ P^^logP) = 0{P^^ \og{P^^)), rather than 0{{P^f), as would be required 
by direct calculation. 

Now, in order to obtain the characteristic vector p of yn = i^o x, we can exploit the following 
relation: 

PyJm) = E[W^'y"] = E[Wp''"''] = Vx{m(8)in), m = 0,...,P-l, (21) 

which means that for each i„, each (m-th) element of the characteristic vector of y„ can be extracted 
from the respective element (m ® i„) of Vx- Note further, that the first (m = 0) element of 
each characteristic vector is 1; and that the conjugate-symmetry of the characteristic vectors can 
be exploited, such that only the "first half (m, = 1,..., [P/2\) needs to be extracted from T'x- 
Naturally, in the absence of the true T'x, we would use the empirical T'x, obtained from the empirical 
probabilities tensor Tx, as described above. 

The extraction of the characteristic vectors p for all «„ requires 0{P^ -PK) additional operations. 
Once these vectors are obtained, they are each converted, using inverse FFT, into probabilities vectors 
p„ , from which the entropies are readily obtained. This requires additional 0{P^ ■ [PlogP + P)) 
operations (excluding the computation of P ■ P^ logarithms). 

Given the entropies of all possible linear combinations (ignoring the trivial iq = 0), the one with 
the smallest entropy corresponds to the first extracted source. Once the smallest-entropy source is 
identified, a "natural" choice is to proceed to the linear combination yielding the second-smallest 
entropy (and so forth), but special care has to be taken, so that each selected coefficients vectors 
should not be linearly dependent (in GF(P)) on the previous ones. One possible way to assure this, 
is to take a "deflation" approach (also sometimes taken in classical ICA - see, e.g., ||9l or [H), in which 
each extracted source is first eliminated from the mixture, and then the lowest-entropy combination 
of the remaining ("deflated") mixtures is taken as the "next" extracted source. However, such an 
approach requires finding the coefficients needed for elimination of the extracted source from each 
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mixture element, as well as recalculation of all the entropies after each deflation stage, which seems 
computationally expensive. A possible alternative is to use a greedy sequential extraction, such that 
the k-th chosen coefficients vector is the one associated with the smallest entropy while being linearly 
independent of the previously selected k — l coefficients vectors. Checking whether a K xl vector b^ 
is linearly independent of the K xl vectors 61, 62, ..., 6fc-i amounts to checking whether there exists 
a nonzero k x 1 vector a, such that [61 • • • b^] o a = 0, which can be checked by an exhaustive 
search among all possible nonzero k x 1 vectors in GF(P). This roughly adds (in the "worst", last 
stage, with k = K) 0{K'^ ■ P^) multiplications. 

The total computational cost is therefore approximately 0{P^ ■{K'^+KP+K\ogP+P\ogP+P)). 
The proposed algorithm, which was given the acronym "AMERICA" (Ascending Minimization of 
EntRopies for ICA) is summarized in Table 1. 
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Algorithm 1: AMERICA 

Input: 'Px ~ the mixtures' K-way PxPx---xP estimated (empirical) 

probabilities tensor; 

Output: B - the K x K estimated separation matrix; 

Notations: We denote by the K x 1 P-nary vector i„ the n-th index 

vector (for n = 0, ...,P^' - 1) , such that ^ = Ef=i ^n.(^)^^"^ 

where i„ = [in(l) • • • in{K)]'^ ; All indices in the description below run 

from 0. 

Algorithm: 

1) Compute T'xr the observations' empirical characteristic tensor, 
by applying a X-dimensional radix-P FFT to T'x- 

2) For n = 0, ..., P"^ — 1, compute /i„, the (empirical) entropy of the 
random variable t/n = ^n ° -^ ^s follows: 

a) Obtain the P x 1 empirical characteristic vector of y„, denoted 
p„, as follows: 

i) Set pn{0) := 1; 
ii) Set pn{l) ■■= Vx{in); 
iii) If P = 3, set pn{2):=V^(in); 
iv) If P > 3, then for ?n, = 2, ..., (P — l)/2, set pn{m) := Vx{fn ® in) and 
p„(P+l -m) := V^{m®in)} 

b) Obtain the P x 1 empirical probabilities vector of y.„, denoted 
p„, by applying an inverse FFT to the vector p,„; 

c) Obtain hn = Y.m=oPn{rn)\ogPn{m)} 

3) Find the smallest entropy among hi, ...,hpK_i and denote the 
minimizing index ni (i.e., /i.„j = min„^o ^n ) ; 

4) Set B := i.^^^ and mark /i„j as "used"; 

5) Repeat for k = 2, ...,K : 

a) Find the smallest among all "unused" entropies; denote the 

minimizing index n^; 

- T 

b) Construct the test-matrix B := [B in,,]) 

c) Go over all nonzero length-A; index vectors j„ (n = 1, ...,p^ — 1) , 
checking whether Boj^ = for some n. If such j„ is found, mark 
hn^ as "used" and find the next smaller entropy (i.e., go to 
step l5aD ; 

d) Set B .= B^ . 
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B. Minimizing Entropies by eXchanging In COuples (MEXICO) 

An alternative separation approach, which avoids prior calculation of the entropies of all possible 
linear combinations, is to try to find the separating transformation by successively minimizing the 
entropies in couples (going over all couples combinations in each "sweep"). More specifically, let 
xi and X2 denote the first two elements of the mixtures vector, and let Pi 2 denote their P x P 
joint probability matrix, which can be obtained from the tensor T-'x by summing along all other 

dimensions: 

p-i 

Pi,2{m,n)= ^ Va:{m,n,i3,...,iK) m,ne[0,P-l]. (22) 

«3,...,JK=0 

Consider a random variable of the form 

xi = xi(B ciS> X2, (23) 

where c G [l,i-* — 1] is some constant. Let ^^^(c) denote the probabilities vector of xi. The ?Ti-th 

element of this vector is given (depending on c) by 

p-i p-i 

Pxi{m;c) = Pr{j;i©c(8)X2 = m} = N^ Pr{xi = n,c^X2 = mQn} = N^ Pi^2{n,c~ ®{mQn))}, 

n=0 n=0 

(24) 

where c~^ denotes the reciprocal of c in GF(P), such that c (iD c^^ = 1. The entropy of xi is then 

given by 

p-i 

H{xi;c) = - ^ps,{m;c) log ps,{m;c). (25) 

m=0 

consider the value cq of c which minimizes H{xi;c). If the resulting entropy is smaller than the 
entropy of xi, then substitution of xi with xi = xi (B cq X2 in x would be an invertible linear 
transformation which reduces the sum of entropies of the elements of x. 

Note ,in addition, that following this transformation the mutual information I{xi,X2) = H{xi) + 
H{x2)—H{xi,X2) will be smaller than I(xi, X2), because the joint entropies H{xi,X2) and H{xi, X2) 
are the same (since the transformation is invertible). Therefore, this transformation also makes these 
two elements "more independent". 

Thus, based on this basic operation, a separation approach can be taken as follows. Let y denote 
the random vector of "demixed" sources to be constructed by successive linear transformations of x, 
and initialize y = x, along with its probabilities tensor T'y = T'x- Proceed sequentially through all 
couples yk, ye in y: For each couple, compute the joint probabilities matrix Pk,e, and then look for 
the value of c which minimizes the entropy oi y^ = yf^ (B cS) y^. If this entropy is smaller than that 
of yk, replace y^ with y^, recording the implied linear transformation as y = V{k,i;c) o y, where 

V{k,i;c)=I + c-Ek/, (26) 
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Ek^£ denoting a. K x K all-zeros matrix with a 1 at the {k,£)-th position. If the minimal entropy of 
yk is larger than that of yk, no update takes place, and the next couple is addressed. 

Upon an update, y serves as the new y. The probabilities tensor T'y is updated accordingly 
(this update is merely a permutation, attainable using (l20l ) with G = V{k,i;c)). The procedure is 
repeated for each indices-couple {k, i) (with k ^ i), and we term a "sweep" as a sequential pass 
over all possible K{K — 1) combinations (note that there is no symmetry here, namely, the couple 
{£,k) is essentially different from {k,€)). Sweeps are repeated sequentially, until a full seep without 
a single update occurs, which terminates the process. 

In practice, the algorithm is applied starting with the empirical observations' probabilities tensor 
T'tc, and the accumulated sequential left-product of the V{k,(,;c) matrices yields the estimated 
separating matrix. Since the sum of marginal entropies of the elements of y is bounded below and is 
guaranteed not to increase (usually to decrease) in each sweep, and since the algorithm stops upon 
encountering the first sweep without such a decrease - such a stop is guaranteed to occur within a 
finite number of sweeps. 

Note, however, that in general there is no guarantee for consistent separation using this algorithm, 
i.e., even if the true probabilities tensor Vx of the observations is known (and used), the stopping 
point is generally not guaranteed to imply separation. The rationale behind this algorithm is the hope 
that such a "pairwise separation" scheme would ultimately yield pairwise independence, which, at 
least for P = 2 and P = 3, would in turn imply full independence (hence separation), per Theorem |2] 
above. Strictly speaking, however, this algorithm is not even guaranteed to yield pairwise separation. 
For example, consider the P = 2 case, with a mixing matrix 

111 
110 
10 10 
10 1 

when all the sources have equal p{l) (probability of taking the value 1). In this particular case, the 
number of 1-s in a linear combination of any two lines is greater or equal to the number of 1-s in 
each of the two lines. Therefore, there is no pairwise linear combination which reduces the entropy 
of any of the mixtures in this case. Therefore, the algorithm may stop short of full separation when 
such a condition is encountered. 

Nevertheless, such conditions are relatively rare, and, as we show in simulation results in the 
following section, this algorithm is quite successful. Its leading advantage over AMERICA is in its 
reduced computational complexity when the unmixing matrix B is sparse and K » P. 

Indeed, the computational complexity of this iterative algorithm naturally depends on the number 
of required sweeps and on the number of updates in each sweeps - which in turn depend strongly on 



(27) 
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the true mixing matrix A (and, to some extent, also on sources' realizations). Testing each couple 
{k,£) requires computation of the joint probabilities matrix P^/ - which requires 0{P^) additions 
(no multiplications are needed). Then, looking for the optimal c requires P — 1 computations of 
the probabilities vector of the respective yi^ - a total of additional 0{P^) additions (again, no 
multiplications are needed for this) and 0{P'^) log operations. If an update takes place, recalculation 
of Vy is also needed, which is 0{P^) (but, as mentioned above, this is merely a permutation of the 
tensor). 

Therefore, the first sweep requires 0{P'^{P^ + P^)) = 0{P^+'^ + P^)) operations and ©(P'') 
log operations, plus 0{P^'^) for each update within the sweep. Naturally, a couple tested in one 
sweep does not have to be tested in a subsequent sweep if no substitution involving any of its 
members had occurred in the former. Therefore, for subsequent sweeps the number of operations can 
be significantly smaller, depending on the number of updates occurring along the way - which is 
obviously data-dependent. The number of required sweeps is also data dependent. 

Thus, the computational complexity of this algorithm, assuming K > ?,, can be roughly estimated 
at 0{P^ • {NdP"^)), where N^ denotes a data-dependent constant, which can be very small (of the 
order of 2 — 3) when the true demixing matrix B is very sparse (only a few sweeps with few updates 
are needed), but can be considerably large when B is rather "rich". Compared to the computational 
complexity of AMERICA, we observe that, assuming K » P, this algorithm is preferable if 
NdP'^ < K'^. 

The algorithm, which was given the acronym "MEXICO" (Minimizing Entropies by exchanging 
In COuples) is summarized in Table 2. 
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Algorithm 2: MEXICO 

Input: 'Px ~ the mixtures' K-way PxPx---xP estimated (empirical) 

probabilities tensor; 

Output: B - the K x K estimated separation matrix; 

Algorithm: 

1) Initialize: B := /. Conceptually, we denote the "demixed" random 
vector y = Box, so set 'Py.= 'Px} 

2) Initialize: h=[hi---hK[^ with the empirical entropies of the K 
respective elements y^ (each computed from the empirical 
probabilities vector, which is obtained by summation over all 
other (^ k) dimensions in T^y) ; 

3) Initialize F, as a K x K all-ones flags matrix: F{k,i) = 1 means 
that the {k,i)-th couple needs to be (re) tested; 

4) Run a "sweep": Repeat for k = I,... ,K , for £ = 1,. . . ,K, £ j^ k 
If F{k,e) = l do the following: 

a) Compute Pk/r the empirical joint probabilities matrix of y^ 
and y£, by summation over all other dimensions {^k,i) in T^y} 

b) For c=l,...,P — 1, compute the elements of Py^{c), the 
probabilities vector of yk = yk®c®yi, in a way similar to (l24b , 
yielding its entropy H{yk;c); 

c) Denote the minimum entropy as Hq = H{yk;cQ) (with cq denoting the 
minimizing c) ; 

d) If Hq < hf: apply a substitution: 

i) Set V = I + co-Ek,i; 
ii) Update B:=VoB; 

iii) Update the probabilities tensor using (1201 ) with G = V; 
iv) Mark all couples involving k as "need to be retested" : 
F(A:, :):=!, F{:,k):=l; 
v) Update h^ := Hq; 
vi) (Conceptually: y ■= V o y) ; 

e) Mark the (A;,^)-th element as "tested": F{k,£) = 0, and proceed; 

5) If F ^ I (there are still couples to be (re) tested), run another 
sweep; Else stop. 
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VI. Rudimentary performance analysis and simulation results 

In this section we present a rudimentary analysis of the expected performance of the proposed 
algorithms, in order to obtain an estimate of the expected rate of success in separating the sources, 
at least in some simple cases. 

Let us first establish the concept of equivariance. In classical ICA, an algorithm is called equivariant 
(see, e.g., ifTO]) with respect to the mixing matrix A, if its performance does not depend on A (as 
long as it is invertible), but only on the realization of the sources. This appealing property is shared 
by many (but certainly not by all) classical ICA algorithms (in the context of noiseless classical ICA). 

We shall now show that, with some slight modification, the AMERICA algorithm is equivariant. 
Recall that AMERICA is based on computation of all the empirical probabilities vectors p of the 
random variables y„ = i'^ox for all possible index-combinations in, followed by sequential extraction 
of the index-vectors in corresponding to the smallest entropies (while maintaining sequential mutual 
linear independence). Although not directly calculated in this way in the algorithm, the £-th element 
of p is evidently given by 

_ 1 ^ 

PyM = Mil ox = i} = -Yl ^i^" ° ^W = ^}' (28) 

t=i 

where Pr{-} denoted the empirical probability, and where /{•} denotes the Indicator function. But 
since a; [t] = A o s [t] , we obviously have 

I{il o x[t] =£} = I{{A^ o inf o s[t] = i}, (29) 

which means that with any given realization s[l], • • • ,s[T] of the sources, the empirical probabilities 
vector p of yn = i^ o x obtained when the mixing matrix is A, is equal to some empirical 
probabilities vector Py^ of ym = i^ o x obtained when the mixing matrix is / (i.e., when there 
is no mixing), such that im = A^in- Since A is invertible, this relation is bijective, which implies 
that the P^^ — 1 empirical probabilities vectors obtained with any (invertible) mixing are merely 
a permutation of the same set of P^^ — 1 vectors that would be obtained when the sources are 
not mixed. Consequently, if, based on the empirical entropies of these empirical vectors, the matrix 
B = [i„j i„2 • • • ^ukY i^ formed by the algorithm when the mixing-matrix is A, this implies that 
the matrix 

-Bo = [irm ini2 ■ ■ ■ imiA^ = {A^ o [in^ in^ ■■■ i„^j) = Bo A (30) 

would be formed by the algorithm when the sources are unmixed. Consequently, the overall mixing- 
unmixing matri}(G Bo Am. the mixed case would equal the overall mixing-unmixing matrix Bq oI = 
B o A in the unmixed case. This means that, no matter what the (invertible) mixing matrix is, the 

^This matrix is sometimes also called the "contamination" matrix, describing the residual mixing (if any). 
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overall mixing-unmixing matrix would be the same as would be obtained by the AMERICA algorithm 
in the unmixed case - implying the desired equivariance property. 

There is, however, one small caveat that has to be considered. The reasoning above assumes that 
the sequential progress of the algorithm through the sorted empirical entropies for selecting, testing 
(for linear dependence) and using the index-vectors is uniquely determined by the calculated entropy 
values, and is independent of the values of the index-vectors. This is generally true, with one possible 
exception: If the set of empirical entropies happens to contain a subset with equal entropies, the 
(arbitrary) order in which the index-vectors within such a subset are sorted is usually lexicographic 
- which introduces dependence on the actual index values, and such dependence is not permutation- 
invariant - thereby potentially introducing dependence on the mixing matrix in turn. In order to avoid 
this condition, any sub-group with equal empirical entropies should be somehow inner-sorted in a 
way which is independent of corresponding index-vectors values - e.g., by randomization. Note that 
the occurrence of such a subset (with empirical entropies that are exactly equal) becomes very rare 
when the number of observations T is large, but may certainly happen when T is relatively small. 
Note further, that with such randomization the attained separation for a given realization depends 
not only on the sources' realization, but also on this random sorting within subsets (but not on the 
mixing matrix), and therefore only statistical measures of the performance (e.g., the probability of 
perfect separation) can be considered equivariant. 

Having established the equivariance, we now proceed to analyze the probability of perfect separation 
in the most simple case: P = 2, K = 2. Thanks to the equivariance property we may assume, without 
loss of generality, that the mixing matrix is the identity matrix, A = I. Let Psi(l) = Pi (resp., 
^82(1) = P2) denote the probability with which the first (resp., second) source takes the value 1. Due 
to the assumed non-mixing conditions (A = I), these are also the probabilities of the "mixtures" xi 
and X2- To characterize the empirical probabilities tensor T'x, let us denote by Nqq, Nqi, Niq and 
Nil the number of occurrences of x[t] = [0 0]^, x[t] = [0 1]^, x[t] = [1 0]^ and x[t] = [1 1]^ (resp.) 
within the observed sequence of length T. Thus, the elements of the 2x2 empirical probabilities 
tensor (matrix in this case) are Vx{mi,m2) = Nm^^m^jT, for mi,m2 G {0, 1}. 

The empirical probability Pxi(l) of xi taking the value 1 is given by 'Pa, (1,0) + 'Pa, (1,1) = 
(A^io + Nii)/T. The empirical probability Pxi®x2{'^) of the random variable xi 0X2 taking the value 
1 is given by Vx{^,0)+'Px{0, 1) = (A'^io + A''oi)/T. An identification error would occur if the entropy 
associated with the latter be smaller than that associate with the former (because then the (wrong) 
linear combination vector t^ = [1 1] would be preferred by the algorithm over the (correct) linear 
combination vector if = [1 0] as a row in B). 

In the P = 2 case, the entropy is monotonically decreasing in the distance of p(l) (or p(0)) from 
i. Assuming that T is "sufficiently large", the empirical ^^^^(l) would be close to its true value pi. 
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and the empirical ^^^^^^^(l) would be close to its true value pi{l — p2) + P2{^ — pi) = Pi+P2 — '^PiP2- 
Assuming that pi,p2 < ^, both pi and pi + P2 — 2pip2 are smaller than ^, and we can therefore 
assume that so are the empirical Px^i^) and px^(Sx2{^)- Thus, the empirical entropy associated with 
the linear combination xi © X2 would be smaller than that associated with xi if 

Px,(Bx,{l) < PxA^) ^ ^iNio + Noi)<^{Nio + Nii) ^ Noi<Nn. (31) 



We are therefore interested in the probability of the event HI : A'^oi < A'^n. Let us denote by 
-^2 = -^01 + -^11 the number of occurrences of X2[t\ = 1 in [l,T]. The probability of HI can then 
be expressed as follows: 

T 

Pr{Hl} = Pr{iVoi < iVn} = Pr{Nu > iA^a} = Yl ^^^^^ = M n Nu > |M} = 

A/=l 
T 

Y^ Pt{N2 = M} PrjA^ii > lM\N2 = M}. (32) 

M=l 

Due to the statistical independence between the sources (and therefore between xi and X2), given 
that A''2 = M, the random variable A^n is simply the number of occurrences of xi[t] = 1 among M 
independent trials - a Binomial random variable with M trials and probability pi, which we shall 
denote as Ni^m ~ B{M,pi). Thus, 

T 



Pr{Hl} = Pr{7Voi < Nu} = Y ^^^^2 = M}Pr{7Vi,M > ^M} = 

M=l 

Y(l)p¥i^-P2f-''- E (^>fa-pi)^^^" (33) 

M=l ^ '^ I A/ I ^ '^ 

7V=[fJ+l 

The inner sum is the complementary cumulative distribution function of the binomial distribution, 
which can also be expressed using the normalized incomplete beta function^, 

Vx{Ni^M >\M} = 1- Vx{Ni^M < |M} = 

1 - /i_p,(M - [lM\ , [lM\ + 1) = IpA[lM\ + 1, \lM]), (34) 

with 

p 

Ip{n,m)=n(''^''^~^^ f r~\l - tr~^dt = 1 - Ii^p{m,n). (35) 


Note further (from (|33]) ). that the probability of HI can be expressed as 



Pr{Hl} = E [/,, ( [|iV2j + 1, \lN2] )] (36) 



'See, e.g.. Binomial Distribution from Wikipedia [online], available: 

|http : //en . wikipedia . org/wiki/Binomial_distribution| 
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(where the expectation is taken with respect to A'^2)- When p2-T is "sufficiently large" this probability 
may be approximated by substituting N2 with its mean, E[N2] = p2 -T, 

Pr{Sl}p./,,(LfrJ+lJfr]). (37) 

The event HI is just one possible component of an error event in which the algorithm would 
prefer the (wrong) linear combination vector ij = [1 1] over the (correct) linear combination vector 
ii = [10]. Such an error may also happen when the empirical entropies of xi and of xi © X2 are 
equal, namely when A'^oi = -^11 ^ assuming that the algorithm makes a random decision in such cases 
(to ensure mean equivariance, as discussed above), the probability of an error being caused by this 
event (denoted H2) would be ^ Pr{H2}. Evidently, 

T 

Pr{H2} = Pr{7Voi = A^ii} = Y. ^^^^2 = M}Pr{7Vi,M = 5M} = 

M=0 
LT/2J LT/2J 

Z^ \2M')P2 U P2) ■\M')Pl y^ P^) - Z^ {T-~2M)\{M\Y \ {I--P2T ) ' ^ ' 

M'=0 M=0 

Note that since the event A^i,j\/ = ^M can only happen for even values of M, an approximation 
using the mean with respect to A'^2 (as used for Pr{Hl} above) would be far less accurate, and would 
therefore not be pursued. 

Summarizing this part of the error analysis, the probability that the algorithm would wrongly prefer 
ij = [1 1] over i^ = [1 0] as a row in B can be approximated as 

lT/2i ^ 

Pr{-1} + 1 Pr{H2} « /,, ( [fT\ + 1, \fT] ) + i • J] (^^S^W ('%^) " ^^9) 

M=0 

An error in the "opposite" direction occurs when the algorithm prefers i'^ = [1 1] over i|^ = [0 1] 
as a row in B. The probability of this kind of error is evidently given by the same expressions by 
swapping the roles of pi and p2- A failure of the algorithm is defined as the occurrence of either one 
of the two errors. Although they are certainly not mutually exclusive, we can still approximate (or 
at least provide an approximate upper-bound for) the probability of occurrence of either one, by the 
sum of probabilities of occurrence of each. Assuming, for further simplicity of the exposition, that 
Pi = P2 = P, the approximate probability of failure is given by 

LT/2J ^ 

PrjFailure} .. 2 • /,( [f TJ + 1, [f T] ) + J^ (^Sfe^ (1^) " ^^0) 

M=0 

Recall that two assumptions are necessary for this approximation to hold: i) that p is sufficiently 
smaller than 0.5; and ii) that p • T is sufficiently large. 

In order to test this approximation we simulated the mixing and separation of K = 2 independent 
binary (P = 2) sources, each taking the value 1 with probability p. In Fig. [T] we compare the analytic 
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Fig. 1. Empirical probability of failure ('o') and its analytic approximation (solid) vs. the probability p for 
P = 2, K = 2 sources, T = 100. The empirical probabilities were obtained using 25, 000 independent trials 



prediction (l40l ) to the empirical probability of failure obtained in 25, 000 independent experiments 
(the sources and the mixing matrix were drawn independently in each trial) vs. p for T = 100. 
Failure of the separation is defined as the case in which ^ o A is not a permutation matrix. We used 
the AMERICA algorithm for separation (but for this {K = 2) case, similar results are obtained with 
MEXICO). The circles show the empirical probabilities, whereas the solid line shows the approximate 
analytic prediction (l40l ). The good match is evident. 

When K is larger than 2, an approximate error expression can be obtain by assuming that this 
type of error can occur independently for each of the K{K — l)/2 different couples. Under this 
approximate independence assumption, we get 

PrjFailure; K} ^ I - {1 - PrjFailure; K = 2})^^^-'^\ (41) 

where PrjFailure; i^T = 2} is given in (l40l ) above. We assume here, for simplicity of the exposition, 
that all of the sources take the value 1 with similar probability p. Extension to the case of different 
probabilities can be readily obtained by using (l39l ) for each (ordered) couple. 

To illustrate, we compare this expression in Fig. |2]to the empirical probability of failure (obtained 



25 



p=0.35 



o 10 



O 



Q- 10 




10 



10 



Fig. 2. Empirical probability of failure ('o') and its analytic approximation (solid) vs. the observation length 
T for P = 2, K = 2,3,4,5 and 6 sources with p — 0.35, using the AMERICA algorithm. The empirical 
probabilities were obtained using 100, 000 independent trials 



in 100, 000 independent experiments) vs. T for p = 0.35 with K = 2, 3, 4, 5, 6. Again, failure of the 
separation is defined as the case in which B o A is not a permutation matrix (namely, any result 
which does not provide perfect separation of all of the K sources is considered a "failure"). A good 
match is evident for the smaller values of K, with some departure for the higher values - as could 
be expected from the approximation induced by the error-independence assumption. 

Next, we compare the empirical, average running-times of the two separation algorithm under 
asymptotic conditions. The "asymptotic" conditions are emulated by substituting the estimated (em- 
pirical) probabilities tensor Vx with the true probabilities tensor Vx as the input to the algorithms. 
We simulated two cases: A "full" mixing matrix and a "sparse" mixing matrix. The "full" K x K 
(non-singular) mixing matrices were randomly drawn in each trial as a product of a lower triangular 
and an upper triangular matrix. The lower triangular matrix L was generated with random values 
independently and uniformly distributed in GF(P) on and below the main diagonal, substituting any 
0-s along the main diagonal with 1-s; The upper diagonal matrix U was similarly generated by drawing 
all values above the main diagonal, and setting the main diagonal to all-l-s. Then A = U o L. For 
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Fig. 3. Average running times (in [seconds]) for the AMERICA (dashed) and MEXICO (solid) algorithms, for 
full ('*') and sparse ('o') matrices. Note that the AMERICA plots for both the full and the sparse mixing case 
are nearly identical. 



generating the "sparse" matrices, the off-diagonal values of L and U were "sparsified" by randomly 
(and independently) zeroing-out each element, with probability 0.9. 

The elements of each of the sources' probabilities vectors p^^, . . Pg^ were drawn uniformly in 
(0, 1) and then normalized by their sum. The average running times (using Matlab® code lITTI for 
both algorithms on a PC Pentium® 4 running at 3.4GHz) for several combinations of P and K are 
shown in Fig. (3] Both algorithms were applied to the same data, and the running times were averaged 
over 4000 independent trials. As expected, the AMERICA algorithm is seen to be insensitive to the 
structure (full / sparse) of the mixing matrix; However, the MEXICO algorithm runs considerably 
faster when A is sparse. Therefore, in terms of running speed, MEXICO may be preferable when 
the mixing matrix is known to be sparse, especially for relatively high values of K. 

Note, however, that this advantage is somewhat overcast by a degradation in the resulting separation 
performance. While perfect separation was obtained (thanks to the "asymptotic" conditions) in all of 
the timing experiments by the AMERICA algorithm, few cases of imperfect separation by MEXICO 
were encountered, especially in the highest values of K with the "full" mixtures. 
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Fig. 4. Empirical mean number of unseparated sources (out of the K = 5 sources for AMERICA (dashed) 
and MEXICO (solid) algorithms, for full ('*') and sparse ('o') matrices for P — 3,5, 7. Each point reflect the 
average of 40, 000 trials. Note that the AMERICA plots for both the full and the sparse mixing case are nearly 
identical. 



To conclude this section, we provide (in Figj4| some empirical results showing the performance 
for P = 3, 5, 7 with K = 5 sources, with random sources' probabilities vectors. The randomized 
elements of the probability vectors were independently drawn (for each source, at each trial) from a 
uniform distribution, and then normalized such that the sum of elements of each probability vector 
adds up to 1. The mixing matrix was randomized at each trial as described above, once for a "full 
A" and once for a "sparse A" version. In this experiment the performance is measured as the mean 
number of unseparated sources, which is defined (per trial) as the number of rows in the resulting 
"contamination matrix" B o A containing more than one non-zero element (since, by construction 
of B in both MEXICO and AMERICA, B o A is always nonsingular, this is exactly the number of 
sources which remain unseparated by the algorithm). Each result on the plot reflects the average of 
40, 000 trials. 

Evidently, the AMERICA algorithm seems significantly more successful than the MEXICO algo- 
rithm, especially with the higher values of P (interestingly, the performance of AMERICA seems to 
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improve with the increase in P, whereas the performance of MEXICO exhibits an opposite trend). 
The advantage of MEXICO is confined to cases of small P and large K, where its potentially reduced 
computational load does not come at the expense of a severe degradation in performance. 

VII. CONCLUSION 

We provided a study of general properties, identifiability conditions and separation algorithms for 
ICA over Galois fields of prime order P. We have shown that a linear mixture of independent sources 
is identifiable (up to permutation and, for P > 2, up to scale) if and only if none of the sources is 
uniform. We have shown that pairwise independence of an invertible linear mixture of the sources 
implies their full independence (namely, implies that the mixture is a scaled permutation) for P = 2 
and for P = 3, but not necessarily for P > 3. 

We proposed two different iterative separation algorithms: The first algorithm, given the acronym 
AMERICA, is based on sequential identification of the smallest-entropy linear combinations of the 
mixtures. The second, given the acronym MEXICO, is based on sequential reduction of the pairwise 
mutual information measures. We provided a rudimentary performance analysis for P = 2, which 
applies to both algorithms with K = 2, demonstrating a good fit of the empirical results to the 
theoretical prediction. For higher values of K (still with P = 2), we demonstrated a reasonable fir 
up to K sa 6 for the AMERICA algorithm. 

AMERICA is guaranteed to provide consistent separation (i.e., to recover all sources when the 
observation length T is infinite), and generally exhibits better performance (success rate) than MEX- 
ICO with finite data lengths. However, when the mixing-matrix is known to be sparse, MEXICO can 
have some advantage over AMERICA is in its relative computational efficiency, especially for larger 
values of K. Matlab® code for both algorithms is available online ifTTl . 

Extensions of our results to common variants of the classical ICA problem, such as ICA in 
the presence of additive noise, the under-determined case (more sources than mixtures), possible 
alternative sources of diversity (e.g., different temporal structures) of the sources, etc. - are all possible. 
For example, just like in classical ICA, temporal or spectral diversity would enable to relax the 
identifiability condition, so as to accommodate sources with uniform (marginal) distributions, which 
might be more commonly encountered. However, these extensions fall beyond the scope of our current 
work, whose main goal is to set the basis for migrating ICA from the real- (or complex-) valued 
algebraic fields to another. 
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Appendix A - A proof of Theorem 2 

In this Appendix we provide a proof of Theorem 2 for both GF(2) and GF(3). Let s he a. K x 1 
random vector whose elements are statistically-independent, non-degenerate and non-uniform random 
variables in either GF(2) or GF(3), and let y = D o s denote a. K x 1 vector of non-trivial linear 
combinations of the elements of s over the field, prescribed by the elements of the K x K matrix 
D (in either GF(2) or GF(3), resp.). 

Assume that D is a general matrix, and consider any pair yj- and yi {k ^ £) in y. y^ and yi are 
linear combinations of respective groups of the sources, indexed by the non-zero elements in D^ . 
and Di^-, the fc-th and i-th rows (resp.) of D. 

Let us consider the case of GF(2) first. 



A. The GF(2) case 

The two groups composing y^ and y£ define, in turn, three other subgroups (some of which may 
be empty): 

1) Sub-group 1: Sources common to D^ . and Di^.. Denote the sum of these sources as u; 

2) Sub-group 2: Sources included in Dk.. but excluded from Di^-,. Denote the sum of these sources 
as vi; 

3) Sub-group 3: Sources included in D^^- but excluded from Dk,-- Denote the sum of these sources 
as V2. 



For example, if (for K = 6) Dk,-. 



and Df ■ 



110 11 



, then 



11111 
ti = S2 © S5 © se, vi = S3 © S4 and V2 = si. 

Note that by construction (and by independence of the elements of s), the random variables u, vi 
and V2 are statistically independent. Their respective probabilities vectors and characteristic vectors 
are denoted 



Pp 



Pu{l) 



Pv 



1 



, with Oi, = \ — 2py{l) , for i' = u,vi,V2- 



(42) 



Obviously, Vk = u® f i and y£ = u ® V2, so their characteristic vectors are given by 



Py, =PuO Pv, 



1 



'U^Vi 



> Pv, =PuQPv 



1 



(43) 



where denotes the Hadamard (element-wise) product. 

Define the random vector w = [yk ye]'^, which can be expressed as the sum of three independent 

random vectors: 

Vk- V^ U 

(44) 
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The probabilities matrices of the vectors vi, V2 and u are evidently given by 



•Ui 



1-P.,(1) 



Vo 







u 



l-Pn(l) 

Pn(l) 



(45) 



and therefore their characteristic matrices are given by 



1 1 



1 



1 



u 



9u 1 



(46) 



where (see Section HI]) 9i, = -E[VF|'] = £'[(—1)''] = 1 — 2pi,{l), for z/ = vi,V2,u. Since vi, V2 and 
u are statistically independent, the characteristic matrix of w is given by 

1 6, ,6,, 



Pw — Pvx Pu Pvx 



-'I'l^U >^l'i^t)2 



(47) 



On the other hand, if y^ and ye are statistically independent, the characteristic matrix of w is also 
given by 



1 
Equating the expressions on (1471 ) and (1481 ). we get (only the (2, 2) element can differ) 



(48) 



Since, due to Lemma[T] if neither of the sources is uniform, neither are vi and V2, we have 6^-^ , 6^^ ^ 0, 
and therefore 9u must be either 1 or —1. Since neither of the sources is degenerate, this can only 
happen ii u = (deterministically), which can only happen if sub-group 1 is empty, namely, if the 
two rows Dk^: and Di^- do not share common sources, or, in other words, if there is no column m 
in D such that both Z?fc^m and -D^^m are 1. 

Applying this to all possible pairs of k ^ i (for which y^ and ye are independent), and recalling 
that D cannot have any all-zeros row (no trivial combinations in y), we immediately arrive at the 
conclusion that each row and each column of D must contain exactly one 1, meaning that D is a 
permutation matrix. 

We now turn to the case of GF(3). 

B. The GF(3) case 

For simplicity of the exposition, we shall now assume that the values taken in GF(3) are {0,1 — 1} 
(rather than {0, 1, 2}). Just like in the GF(2) case, we partition the two groups composing y^ and ye 
into subgroups, but now the first ("common") subgroup is further partitioned into three sub-subgroups: 

1) Sub-group 1: Sources common to D^ . and De^-,. We partition this sub-group into four sub- 
subgroups according to the coefficients in the respective rows of D as follows: 
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• Sub-subgroup la: sources for which the respective coefficients in D^. and D^ . are both 1; 
Denote the sum of these sources as tt++; 

• Sub-subgroup lb: sources for which the respective coefficients in Dk^, and D^^-, are both 

— 1; Denote the sum of these sources as u ; 

• Sub-subgroup Ic: sources for which the respective coefficients in Dk^- and D^^ are 1 and 

— 1, resp.; Denote the sum of these sources as u^ ; 

• Sub-subgroup Id: sources for which the respective coefficients in D^. and D^ . are —1 and 
1, resp.; Denote the sum of these sources as u |_; 

2) Sub-group 2: Sources included in D^ . but excluded from Di^.^. Denote the respective linear 
combination of these sources as vi ; 

3) Sub-group 3: Sources included in D^. but excluded from D^.. Denote the respective linear 
combination of these sources as V2. 



For example, if (for K = Q) Dj^^. 



1 



-1111 



and Df 



-1-10 11 

then n+_|_ = S5 © s^, u = ii_+ = 0, «+_ = S2, vi = —S3 © 54 and V2 = — si- 

The random variables m++, u , u^ , u [_, vi and V2 are statistically independent. Their respective 

probabilities vectors and characteristic vectors are denoted 



PuiO) 

Pu{l) 

Pu{2) 



, Pu 



for i/ = u_|__|_,u ,ii_| ,u \-,vi,V2- 



(50) 



An expression for ^^ = -E[VF|'] in terms of Pu{0), Pu{^) and Pi/(2) can be found in ^ above. Note 
further, that ^_^ = Q, so that p_^ = P^- 
Evidently, 

Vk = vi® n++ © u — © u+_ © u_+; , ye = V2® u++ © u — © u+_ © u — , (51) 

so their characteristic vectors are given by 

Py, = Pv, P«++ P«__ P«+_ pL+ 

Py, = Pv, Pu++ P*u.. P«+_ P«_+ (52) 

The random vector w = [y/c y^]^ can now be expressed as the sum of five independent random 
vectors: 

Vh V^ Uj_4- —U Uj —U u 

(53) 
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The probabilities matrices of the vectors i^i, V2, ii++, u , ii_| and u |_, and their respective 

characteristic matrices are given by 



V^ 



PvM 





Pvd^) 





p.. (2) 






Vi 
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Cv, 


e.. 


c. 


e; 


e;. 



^2 



PvM PvA^) PvM 







vi 
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?-«2 


C2' 
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U 




1 


?-«2 





li. 






) 

Pu^Ai) 


=> Pu 





Pu,A2) 





u 



P«__(0) 

P«__(2) 

p,__(l) 



u 



u+ 



Pu.Ao) 















u^ 



U- 



Pu^M 



PuAi) 






Thus, the characteristic matrix of w is given by the Hadamard product of these matrices, 

Pw = Pv, Pv, -Pn++ -Pn__ -P«+_ P«_+. 



1 ?«++ ^u++ 

eL_ e«-- 1 
e.-_ 1 eL_ 

S-U+- J- S-u+_ 

1 






(54a) 



(54b) 



(55a) 



(55b) 



(55c) 



(55d) 



(56) 



Now, if yk and y^ are statistically independent, then Pw is also given by the outer product of their 
characteristic vectors, which, using (l52l ). is given by 



H 



--H 



Pw = Py.Py, = [Pv.Pv,) (Pn+ + P«++) (pL_P«_ J (P«+_P«+_) (pL + P« 



(57) 



where (•)^ denotes the conjugate transpose. Noting that p^^p^^ = Pvi -P^'2^ ^^d recalling that, 
since vi and ^2 cannot be uniform, ^^^ and £^,3 must be non-zero, we conclude that the independence 
of yfc and yi implies that 

{Pu^+pI^^) (pL_P^_ J iPu^.Pu^J (pL+pLJ = ^«++ ^n__ Pu^. n_+. (58) 

It is easy to observe, that the first row and first column of each of the matrices on the left-hand side 
(LHS) are indeed always identical to those of the respective matrices on the right-hand side (RHS), 
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regardless of the values of the ^ parameters. In addition, in each of the matrices the (2, 2) elemem 
is the conjugate of the (3, 3) element, and the (2, 3) element is the conjugate of the (3, 2) element. 
Therefore, the independence of y^ and y^ merely implies the equality of the products of the (2,2) 
elements on the LHS and on the RHS, and of the products of the (2, 3) elements on the LHS and 
on the RHS. 

The equality of the product of the (2, 2) elements implies 

C • e.-- • 1 • 1 = {iu^^f ■ (C_ J' • ie«,_i' • \iu.,\\ (59a) 

and the equality of the product of the (2, 3) elements implies 

1 • 1 • C_ • in., = \^u,, ? ■ |e.-_ I' • {^u,.f ■ (C_ J'- (59b) 

Taking the absolute values of both, and recalling that since neither of the random variables u++, 
u , U-\ and u ^ can be uniform, neither of the ^ parameters can be zero, we have 

\f I • If \ — \f P • If P • If P • If P ^ If I . 1^ ! . \f |2 . If |2 — 1 

|Sn++| |^u__ I — |C;u_|_^ I |C;«__ I |Sn+_ I |Sn_+ 1 —^ |sn++ 1 Isu-- I |S«+_ I |S«_+ 1 — -■- 



Since for any random variable u in GF(3), \^^\ < 1 with equality iff u is degenerate, we conclude 

from (l60l) that if y^ and yi are independent, then n+4., u , u^ and u |^ must all be degenerate. 

Since none of the independent sources is degenerate, this implies, in turn, that all four are identically 
zero, and that there are no non-zero elements common to Dk,- and Di . 

Like in the GF(2) case, by repeated application of this result to all row-couples in D, we conclude 
that pairwise independence of the elements of y implies that D is (up to signs) permutation matrix, 
namely that the elements of y are fully mutually independent. 
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